Fingerprints for highly similar streams

نویسندگان

  • Yoram Bachrach
  • Ely Porat
چکیده

We propose an approach for approximating the Jaccard similarity of two streams, J(A,B) = |A∩B| |A∪B| , for domains where this similarity is known to be high. Our method is based on a reduction from Jaccard similarity to F2 norm estimation, for which there exists a sketch that is efficient in terms of both size and compute time, which we augment by a sampling technique. Our approach offers an improvement in the fingerprint size that is quadratic in the degree of similarity between the streams. More precisely, to approximate the Jaccard similarity up to a multiplicative factor of with confidence δ, it suffices to take a fingerprint of size O(ln( 1δ ) (1−t) 2 log 1 1−t ) where t is the known minimal Jaccard similarity between the streams. Further, computing our fingerprint can be done in time O(1) per element in the stream.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Maroon Basin Neotectonics and Erosion Using Geomorphometric Techniques

Extended abstract 1- INTRODUCTION In the regions with the active tectonics, the geometry and evolution of the river systems are sensitive to the surface uplift rate. The simply-folded belt of Zagros has been developed during the late Cenozoic era, and is still active as it is affected by the tectonic activity of the Arabian-Iranian shortening processes pertaining to plate boundary, since ...

متن کامل

The Candidate Key Protocol for Generating Secret Shared Keys from Similar Sensor Data Streams

Secure communication over wireless channels necessitates authentication of communication partners to prevent man-in-the-middle attacks. For spontaneous interaction between independent, mobile devices, no a priori information is available for authentication purposes. However, traditional approaches based on manual password input or verification of key fingerprints do not scale to tens to hundred...

متن کامل

Analysis of Gene Expression Data Spring Semester , 2004 Lecture 4 : Mars 17 , 2005

CLICK (CLuster Identification via Connectivity Kernels) is a new algorithm for clustering [14]. The input for CLICK is the gene expression matrix. Each row of this matrix is an " expression fingerprint " for a single gene. The columns are specific conditions under which gene expression is measured. The CLICK algorithm attempts to find a partitioning of the set of elements into clusters, so that...

متن کامل

Priority Setting Meets Multiple Streams: A Match to Be Further Examined?; Comment on “Introducing New Priority Setting and Resource Allocation Processes in a Canadian Healthcare Organization: A Case Study Analysis Informed by Multiple Streams Theory”

With demand for health services continuing to grow as populations age and new technologies emerge to meet health needs, healthcare policy-makers are under constant pressure to set priorities, ie, to make choices about the health services that can and cannot be funded within available resources. In a recent paper, Smith et al apply an influential policy studies framework – Kingdon’s multiple str...

متن کامل

Fingerprint Synthesis: Evaluating Fingerprint Search at Scale

A database of a large number of fingerprint images is highly desired for designing and evaluating large scale fingerprint search algorithms. Compared to collecting a large number of real fingerprints, which is very costly in terms of time, effort and expense, and also involves stringent privacy issues, synthetic fingerprints can be generated at low cost and does not have any privacy issues to d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Comput.

دوره 244  شماره 

صفحات  -

تاریخ انتشار 2015